Monitoring and prediction of an epidemic outbreak using 
syndromic observations 



Alex Skvortsov and Branko Ristic 

Defence Science and Technology Organisation 
506 Larimer Street, Melbourne, VIC 3207 
Australia 



Abstract 

The paper presents an algorithm for syndromic surveillance of an epidemic outbreak 
formulated in the context of stochastic nonlinear filtering. The dynamics of the epidemic 
is modeled using a generalized compartmental epidemiological model with inhomogeneous 
mixing. The syndromic (typically non-medical) observations of the number of infected 
people (e.g. visits to pharmacies, sale of certain products, absenteeism from work/study 
etc.) are used for estimation. The state of the epidemic, including the number of infected 
people and the unknown parameters of the model, are estimated via a particle filter. 
The numerical results indicate that the proposed framework can provide useful early 
prediction of the epidemic peak if the uncertainty in prior knowledge of model parameters 
is not excessive. 

Keywords: Emerging epidemics, bio-terrorism, inhomogeneous mixing, social network, 
mathematical biology, nonlinear filtering, particle filter, imprecise likelihood. 



1. Introduction 

Epidemics can impose significant challenges on modern societies, not only by affecting 
the health of the general population, but also by causing negative trends in the economy 
(medical treatments, absenteeism from work, missed business opportunities, etc). The 
ongoing epidemics of AIDS, tuberculosis and the recent outbreaks of SARS and HlNl 
(swine fiu) provide some revealing examples. In the absence of an effective cure against 
many diseases, the best approach to mitigate an epidemic outbreak (malicious or natural) 
resides in the development of capability for its early detection and for prediction of its 
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further development [s^, Q|, [l^, Q]. Such a capabihty would allow making 

any countermeasures (quarantine, vaccination, medical treatment) much more effective 
and less costly 



Syndromic surveillance is refereed to as systematic collection, analysis, and interpre- 
tation of public health data for the purpose of early (and often cost effective) detection of 
an epidemic outbreak, its continuous monitoring, and timely response by public health 
agencies 3^, [4^, [? ]. The rationale for this method rests on a resalable assumption 
that a spread of an infectious disease is usually associated with the measurable changes 
in the social behaviour. 

It is important to highlight that although the measurements (data streams) for syn- 
dromic surveillance often rely on the medical observations (the visits to medical practi- 
tioners, occurrence of particular symptoms, the number of diagnosed cases, etc) they are 
not restricted to the pure medical data. Recent studies [l^ , [sil , [3| j have demonstrated 
that "non-medical" sources of syndromic data streams, such as the absenteeism from 
work/school, the pharmaceutical sales, Internet queries. Twitter messages, and alike, 
can enable one to draw important conclusions regarding the epidemic state in the com- 
munity. The 'Google Flu' project [l4j (flu-related searches in Google), is a well publicised 
example of this approach. 

The algorithms for syndromic surveillance and its practical implementation have 
recently attracted significant attention by scientists and practitioners; there is a vast 
amount of literature devoted to this topic (for more comprehensive review see jss! ] , [4^ 
and refs therein). In general, all algorithms applied in this area can be divided into 
two main groups, the data mining methods and the information fusion (also known as 
data assimilation) methods. Data mining is primarily concerned with the extraction of 
patterns from massive amounts of raw data without using dynamic models of the un- 
derlying process (i.e. epidemic spread) [l^ . Information fusion algorithms, on the 
contrary, strongly rely of mathematical models: in this case, the dynamic model of an 
epidemic outbreak and the measurement model of a particular syndromic data stream 
III; S ; l^ll- Evidently the accuracy of the information fusion algorithms is significantly 
determined by the fidelity of the underlying models. 

This paper presents a study of a recursive information fusion algorithm for syndromic 



surveillance. The problem is formulated in the Bayesian context of stochastic nonlinear 
filtering and solved using a particle filter. While this problem formulation and the par- 
ticle filter implementation have been considered earlier, see 17 1, [2^, 25 1, 35 1, this 



paper introduces several novelties. In order to overcome the limitations of the standard 
"compartment" model of epidemic (the "well-mixed" approximation) we employ a more 



flexible alternative, initially proposed in [36] (and later extended in [2J], 30|). The 
adopted epidemic model has the explicit parameter of "mixing efficiency" (or level of 
social interaction) and is therefore more appropriate to represent the variety of social 
interactions in a small community (self-isolation and panic). A significant advantage of 
the adopted epidemiological model is a rigorous estimation of its noise component (scal- 
ing law of noise level with the population size of a community, see below) resulting in 
more accurate parameter estimations. Furthermore, a more flexible model o f sy ndromic 
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measurements, validated with a limited data set available in the literature 
adopted in the paper. This measurement model is derived from which has been 
extensively validated by observations and is allowed to have imprecisely known parame- 
ters. The optimal sequential estimator (filter) and predictor are then formulated in the 
Bayesian framework and solved using a particle filter. The paper includes numerical re- 
sults that verify the theoretical framework and analyses the sensitivity to partially known 
epidemic parameters. From this perspective, the present study is a significant extension 



of our earlier work (2£ 



2. Problem specification 



This section describes the models to be used in estimation. To describe the dynamics 
of an epidemic outbreak we employ the generalized SIR epidemic model proposed in 
[lol |- S) [iIj 01 with stochastic fiuctuations. According to this model the population of 
the community can be subdivided into three interacting groups: susceptible, infectious 
and recovered individuals. Let the number of susceptible, infectious and recovered be 
denoted by S, I and R, respectively, so that S + I + R = P, where P is the total 
population size. The dynamic model of epidemic progression in time can be be expressed 
by two stochastic differential equations Q], Qli S the "conservation" law for the 
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population as follows: 



ds 

It 
di 
di 



-ais^ + aq^, (1) 

a i s" - pi - aqS, + apC, (2) 
1-s-i. (3) 



The explanation of notation: s ~ S/P, i ^ I / P, r ~ R/P and ^, ( are two uncorrelated 
white Gaussian noise processes, both with zero mean and unity variance. The terms 
(Tq = (Tq{s, i) and ap = (Tp{s, i) are introduced to capture the demographic noise (random 
variations in the contact rate a and in the recovery time (3 between individuals), for 

details see 0, Q> fl- 

The parameters of this model are described next: /3 is the recovery time, that is a 
reciprocal of the average infectious period for the disease; a — p ■ (3, where p is a well 
known parameter in epidemiology, referred to as the the basic reproductive number, which 
represents the average number of people infected by a direct contact with a sick person. 
Finally v is the population mixing parameter, which for a homogeneous population equals 
1. In the presence of an epidemic, v may vary as people may behave unpredictably (panic) 
or avoid the risk of infection (self-isolation) . In general, the epidemic model parameters 
can be assumed to be partially known as interval values, that is a € [a, a], /? £ [/?,/?] 
and u e [j^, I']. 

The amplitude of the noise terms aq ,(Tp in ^ and (0) can be established from a 
scaling law of Gaussian fluctuations generated by the random contact rate q = ai s'^ and 



recovery time /3. Thus for a dynamical system 
individuals P we can write (for details see Q , 



consisting of a large number of 
I, |37|) 



/ .X ais" pi 

<^qis,^) ~ y o-/3(s,j) w y — . (4) 

With further reasonable assumptions s w sq ~ Ij ~ ''o ~ 0, i ~ ~ l/^i that holds 
for the initial stage of epidemic (where sq, ia, J'o are initial values of s,i,r) expressions 
(HI can be reduced to the following scaling for the noise components of the model 

'^q[s,i)K—, <yp(s,i)~—. (5) 



The measurement model is introduced next. Following [l^, [3] we can assume that 
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a power law relationship holds for odds-ratios of syndromic observations and number of 
infected people: 

r_ fix-' 

(6) 



1 — Zj \1 — i ^ 

where Zj is the observable syndrome with index j = 1, . . . , Nz (normalized by population 
size P), qj is the power-law exponent (in general different for different syndromes). Since 
at the initial stage of epidemic z ^ 1, Zj <^1, eq.® can be reduced to a simple power-law 
model 

^bji"' +Tj, (7) 

where bj = const. The noise term Tj is added to simulate random nature of measurement 
(measurements noise) which is assumed to be uncorrelated to other syndromes and noises 
^ and Q. Since Zj > the noise term Tj associated with syndrome j should be modelled 
with a random variable that provides strictly non-negative realisations (e.g. Poissonian 



21j or truncated Gaussian |10{). In the current study we use a random variable that 
obeys the log- normal distribution: this means that we set Tj = CTjrjj, where rjj is the 
standard log- normal noise rfj ^ \nJV{0, 1), see below. 

Parameters bj , aj , typically are not known, but with a representative dataset of 
observations the model ([7]) can be easily calibrated (see results of the linear regression 
fits in 14]). The data fit reported in [13| suggests that may be close to unity (but 
it is not precisely known because of significant scattering of data points). To cater for 
this uncertainty we assume that (,j can take any value in an interval, <; e [£, around 
^ — 1 (for the sake of simplicity we also assume that all i^j ~ <; — const). Unfortunately 



do not report any specific values of fitting parameters, so we have to use some 
heuristic values for bj, aj in our simulations. 

The scaling law for the noise parameter aj in ([7]) can be deduced by employing 
the arguments leading to (O and thus we arrive at the different scaling aj cx {l/Py, 
e ~ min{(<; + l)/2, 1/2}. The later scaling law in conjunction with the scaling laws ([5]) 
provides a consistent way to compare predictive skills of our algorithm for communities 
with different population size and to draw conclusive decisions about its performance 
before any operational deployment. 

Being formulated in the Bayesian framework, our problem is to estimate the (nor- 
malised) number of infected i and susceptible s at time i, using syndromic observations 
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Zj ([7]) collected up to time t. Let x denote the state vector to be estimated; it includes i 
and s, but also the imprecisely known parameters a, /? and v. The formal Bayesian so- 
lution is given in the form of the posterior probability density function (PDF) p{ii.t\zi:t) ^ 
where Xt is the state vector at time t and zi^t denotes all observations up to time t. Using 
the posterior p(xt|zi:t) one can predict the progress of the epidemic using the dynamic 
model ©-©. 



3. Optimal Bayesian solution for imprecise likelihood 



3.1. Formulation 

The model ([II)-® is given in continuous time. For the purpose of computer imple- 
mentation, we require a discrete-time approximation of this model. The state vector is 
adopted as 



a P V 



(8) 



where ^ denotes matrix transpose. Neglecting for the moment the process noise terms, 

the evolution of the epidemic state can be written according to ([Ij-© as x = g(x) 

1 T 



where g(x) 



The nonlinear differential equation 



{as" - -ais" 
governing the evolution of the state cannot be solved in the closed-form. The Euler 
method provides a simple approximation valid for small integration interval r > 0: x(i -|- 
r) w x(t) -|- rg(x(t)). The state-evolution in discrete-time tk can then be expressed as: 



Xfc+i sa ffc(xfc) + Wfc 



(9) 



where k = tk/r is the discrete-time index and ffc(xfc) is the transition function given by 

Xfe[l] +rxfc[l] (xfe[3]xfc[2]-^[^l -xfe[4] 
Xfc[2]-TXfe[3]xfc[l]xfe[2]-'=[5l 
ffe(xfc) = Xfe[3] (10) 

Xfe[4] 
Xfe [5] 

In this notation xj; [i] represents the ith component of vector x^ . Process noise w^: in ([9]) 

is assumed to be zero- mean white Gaussian with a diagonal covariance matrix Q, which 

according to ^ can be expressed as Q w diag[(Q; -f 13)t^/P^, ar'^/P^]. 
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The optimal Bayes filter is typically presented in two steps, prediction and update. 



Suppose the posterior PDF at time tk is given by p(xfc 
computes the PDF predicted to time tm = tk + t as 



zi;k). Then the prediction step 



P{Xm\zi:k) ^ J TT{Xm\xk) P{^k\zi:k) dXk (11) 



where 7r(x|x') is the transitional density. Let Af{y; fi, P) denote a Gaussian PDF with 
mean fi and covariance P. Then according to Q the transitional density is given by 

4x|x') =AA(x;ffc(x'),Q). (12) 

The prediction step is carried out many times with tiny sampling intervals t until an 
observation Zj^k+i becomes available about syndrome j at time tk+i- 

The predicted PDF at tk+i is p{x.k+i\zi-k) ■ In the standard Bayesian estimation 
framework, this PDF is updated using the measurement Zj_k+i by multiplication with 
the measurement likelihood function 16]. According to Q, the likelihood function in 
this case is £{zj^k+i\xk+i) = lnA/'(z; /i(xfe+i; <;), ct|), where /i(xfc;<j) = bj ■ Xfc[l]''. Now, 
the problem is that h{'x.k] <;) defined is this way is not a function because ^ G 

An elegant solution to the imprecise measurement transformation is available in the 
framework of random set theory [20]. In this approach /i(x; <;) defines a closed set Ex on 
the measurement space Z. The closed set Ex is random because the state x is random. 
In fact, a random closed set (RCS) Ex can be seen as a composite mapping. The first is 
the measurable mapping which defines the random variable x : J7 — >■ A", where f2 is the 
sample space and X is the state space. The second mapping is /i(x;^) : X 12, where 
XZ is the set of closed sets of Z. The RCS Ex is therefore a random variable that takes 
values as closed intervals of Z . 

The Bayes updated step using measurement Zj,k+i is now defined as {20! ]: 

, I X _ £>j-,fc+l]xfc+l) ■p(xfc+l]zi:fc) 

P[^k+l\Zl:k+l) - —J ■ ■ — (13) 

J i:(Zj,/c+l]Xfe+l) •p(Xfe+llZl:fc)rfXfc+l 

where £(zj^fe]xfc) is referred to as the generalised likelihood function. For the measurement 
model (O, i{zj^k\'Xk) is defined as [27|: 



i{zj,k\^k) = (t>{z,y,^^^,a]) - 0(zj-fc;Ex,,tT2) (14) 
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where 



Ex = niin{/i(x;£), /i(x; <^)} (15) 
Ex = max{/i(x;£),/i(x;^)}. (16) 

and 0(u; fi, P) = /"^q In A/'(y; /x, P) dy is the cumulative log-normal distribution. 

The recursions of the Bayes filter start with the initial PDF (at time tk = 0), denoted 
p(xo), which is assumed known. 

3.2. Particle filter implementation 

The proposed Bayesian estimator cannot be solved in the closed form. Instead we 
develop an approximate solution based on the particle filter (PF) 28 1. The PF 



approximates the posterior PDF p{'x.k\zi:k) by a weighted random sample, that is: 

s 

p(Xfe|zi:fc) « ^ Wfc^ (5^(0 (x). (17) 
i=l 

Here 5y(x) is the Dirac delta function focused at y, x^*"* is a sample (particle), w^*'' is its 
weight and S is the particle count. As S is increased, approximation p7p becomes more 
accurate. The weights are normalised so that: 

i=l 

The adopted PF implementation is known as the bootstrap PF The algorithm 
starts by forming a random sample from the initial PDF p(xo): Xq*'' ^ p(xo), for i — 
1, . . . , 5. The initial weights are equal, that is Wq*-* = 1/5. The prediction steps consist 
of propagating particles using the dynamic model (jH]). This is implemented by drawing 
samples from the transitional density (jl2p . The prediction steps are carried out until a 
measurement Zj_k becomes available, that is until time tk. Suppose the set of predicted 
particles at time tk is i = 1, . . . , S}. 

The update step is implemented in two stages. First the unnormalised importance 
weights of predicted particles are computed as [28[ : 

ei^^iMLi) (19) 



with generalised likelihood I specified in ([14)) . This is followed by the weight normalisa- 
tion, that is 

4'^ = 4^VE4"^ (20) 

n=l 



for i = 1, . . . , S*. 

The second stage is the resampling step. The role of resampling is to eliminate 
(in the probabilistic manner) the particles with low importance weights and to clone the 
samples with high importance weights. This is carried out by sampling with replacement, 
with the probability of sampling each xJ^H,_-|^ equal to the normalised importance weight 
w"^ . The result is a mapping of a random measure {w^*'' , x^*^j,_j^}^]^ into a new random 
measure with uniform weights: { , x^,*'* . Several computationally efficient resampling 
schemes have been reported; we have implemented the systematic resampling algorithm 

H. 



following the pseudo-code given in Table 3.2 of [28 1 



4. Numerical Results 

4.I. Experimental data set 

The estimation and prediction of an epidemic will be carried out using an experimental 
data set obtained using a large-scale agent based simulated population model 



2M, |33| 



of a virtual town of P = 5000 inhabitants, created in accordance with the Australian 
Census Bureau data. The agent based model is rather complex and takes a long time to 
run. It includes typical age/gender breakdown and family-household- workplace habits 
with the realistic day-to-day people contacts for a disease spread. The blue line in FigH] 
shows the number of people of this town infected by a fictitious disease, reported once 
per day during a period of 154 days (only first 120 days shown). The dashed red line 
represents the deterministic SIR model fit (using the entire batch of 154 data points, and 
setting Wfe = in ([SJ), with estimated parameters a — 0.2399, /3 = 0.1066, = 1.2042. 
The parameter estimates were obtained using importance sampling via the progressive 

n 

correction algorithm ]26]. FigUserves to verify that the generalized SIR model, although 
very simple and fast to run, is remarkably accurate in explaining the data obtained from 
a very complex simulation system (for further details see [36|). 
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Figure 1: The solid line, representing the number of infected people over time, was obtained from the 
agent based simulation population model. The dashed line shows the generalised SIR model fit 

4-2. Estimation and prediction results 

The true number of infected people in simulations was chosen to be the output of 
the agent based simulated population model, shown by the solid line in FiglTJ The 
measurements are generated synthetically in accordance with ([7]), using the following 
parameters: — 1.05 and bj = 0.25, aj — 0.01, for all j — 1, 2, 3, 4 monitored syndromes. 
Independent measurements concerning all Nz — 4 syndromes were available on a daily 
basis during the first 25 days. The problem was to perform estimation sequentially as 
the measurements become available until the day number 25, and after that to predict 
the epidemic size. 

The initial PDF for the state vector was chosen as p(xo) = p(io)p(so)p(ao)p(/^o)p(t'o) 
withp(io) =7V[o,i](zo;Zj,i/6j,cr2),p(so) = 7V[o,i] (sq; l-Zj,i/&j, cr^), p(ao) = ^o.i8,o.50] ("o), 
p(/?o) = ^o.i,o.i43](/?o), pM = ^o.95,i.3](t'o), where 7V[a,6] (x, /i, F) and U[a^b]{x) denote 
a truncated Gaussian distribution restricted at support [a, b] and a uniform distribution 
with limits [a, b], respectively. The number of particles was set to = 10000. 

The first run of the proposed simulation setup was carried out assuming that € 
[1.03, 1.07]. This corresponds to the case of fairly precise knowledge of (recall the true 
value was 1.05). Fig. [2] (a) shows the histograms of particle filter estimated values of a, 
/3 and ly, after processing all 25 days of measurements (i.e. in total 100 measurements, 
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since = 4). This figure reveals that the uncertainty in parameters a and /3 has 
been substantiaUy reduced, compared with initial p(ao) = ^o.i8,o.50] (o^o) and piPo) — 
^0.1.0.143] (/3o)- The uncertainty in i^, on the other hand, has not been reduced, indicating 
that this parameter cannot be estimated. While this is unfortunate, it does not appear 
to be a serious problem since the prior on v in practice is tight (i/ w 1). This is confirmed 
in Figl31(b) which shows a sample of 100 overlayed predicted epidemic curves (gray lines) 
based on the state estimate after 25 days. Figl5](b) indicates the prediction performance; 
the timing of the peak of the epidemic appears to be fairly accurate, while the size of the 
peak is more uncertain. Most importantly, however, the true epidemic curve (solid red 
line) appears to be always enveloped by prediction curves. 

The second run of the proposed simulation setup corresponds to the case with fairly 
imprecise knowledge of that is e [0.85,1.15]. The results are shown in Fig. [S] 
Comparing Figs. [5] and [3] one can make two observations: first, in accordance with our 
expectations, the estimation and prediction results are more uncertain when knowledge 
of <; is imprecise; second, even when knowledge of is imprecise, the curve corresponding 
to the true number of infected people (solid red line) is always enveloped by predictions. 

5. Conclusions 

We have developed an algorithm for syndromic surveillance of an epidemic outbreak 
formulated in the context of stochastic nonlinear filtering. The dynamics of the epidemic 
is modeled using a compartmental epidemiological model with inliomogeneous mixing 
which explicitly includes a parameter responsible for the level of social interactions. This 
model enables simulation of a rich variety of social behavior that may be observed in a 
small commmiity in response to epidemic (i.e. fear, self- isolations, panic). 

The syndromic (typically non-medical) observations are used as an algorithm in- 
put (e.g. web quries. Twitter messages, sales of certain products, absenteeism from 
work/study etc). The algorithm (a particle filter) provides continuous estimation of the 
state of the epidemic, including the number of infected people and the unknown param- 
eters of the model. The numerical results indicate that the proposed framework can 
provide useful early prediction of the epidemic peak if the uncertainty in prior knowledge 

of model parameters (including social behavior) is not excessive. 
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(b) 



Figure 2: Estimation/prediction results from the particle filter after processing measurements collected 
over 25 days of surveillance assuming that ? £ [1.03, 1.07] (true value ? = 1.05): (a) the histograms of 
estimated parameters a, /3 and v (true values indicated by vertical red lines); (b) Prediction results for 
a random sample of 100 particles (gray lines); the red line is the experimental curve from Fig^ 
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Figure 3: Estimation/prediction results from the particle filter after processing measurements collected 
over 25 days of surveillance assuming that ? £ [0.85, 1.15] (true value ? = 1.05): (a) the histograms of 
estimated parameters a, /3 and v (true values indicated by vertical red lines); (b) Prediction results for 
a random sample of 100 particles (gray lines); the red line is the experimental curve from Fig^ 
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Our future work would encompass a generalization of the proposed framework to 
include detection and prediction of other social and behavioral processes that can be 
described by similar population models (computer worm attacks, propaganda campaigns, 
complex social engagements, sensor networks, quorum sensing, etc, see 
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